a.)
library(ggplot2)
library(plotly)
muscle <- read.table("muscle.txt")
colnames(muscle) <- c("mass", "age")
head(muscle, 10)
## mass age
## 1 106 43
## 2 106 41
## 3 97 47
## 4 113 46
## 5 96 45
## 6 119 41
## 7 92 47
## 8 112 41
## 9 92 48
## 10 102 48
ggplotly(ggplot(muscle, aes(x = mass)) + geom_histogram(bins =15, fill = "#FF6347"))
ggplotly(ggplot(muscle, aes(x = age)) + geom_histogram(bins = 15, fill = "#FFA07A"))
We can see from the plots that mass has a slight skew to the right while the age histrogram has a strong right skew.
ggplotly(ggplot(muscle, aes(x = age, y= mass)) + geom_point(color = "#FFC300"))
We can see that from the scatter plot that the data seems to have a negative linear relationship.
b.)
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked _by_ '.GlobalEnv':
##
## muscle
## The following object is masked from 'package:plotly':
##
## select
bxcx <- boxcox(mass ~ age, data = muscle)
lambda <- bxcx$x[which.max(bxcx$y)]
lambda
## [1] 1.151515
Since our Box Cox 95% confidence interval contains 1, we can conclude that a transformation of our variable is unneccessary.
c.)
#Build the model
modelMM <- lm(mass ~ age, data = muscle)
sumModelMM <- summary(modelMM)
# Coefficients and their standard errors
sumModelMM$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 156.346564 5.51226249 28.36341 1.085415e-35
## age -1.189996 0.09019725 -13.19326 4.123987e-19
# MSE
MSE <- sum(sumModelMM$residuals^2) / (nrow(muscle) - 2)
MSE
## [1] 66.80082
# Degrees of Freedom for MSE
sumModelMM$df[2]
## [1] 58
d.) The regression line for the data is the following: \[ \hat{Y_i} = 156.347 - 1.190X_i \]
ggplotly(
ggplot(muscle, aes(x = age, y = mass)) + geom_point(color = "#FFC300") + geom_abline(
intercept = 156.346564,
slope = -1.189996,
col = "#003cff"
)
)
## Warning: `gather_()` was deprecated in tidyr 1.2.0.
## Please use `gather()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
We can see that our fitted regression line does fit the data very well.
e.)
# Fitted values for 6th and 16th value
modelMM$fitted.values[c(6,16)]
## 6 16
## 107.55675 90.89681
#Residuals for 6th and 16th cases
modelMM$residuals[c(6,16)]
## 6 16
## 11.443252 -3.896811
f.) The Normal Error Model is written as the following: \[ Y_i = \beta_0 + \beta_1X_i + \epsilon_i \] with \(i = 1, \dots, n\) where the error terms are independently and identically distributed (i.i.d.) \(N(0, \sigma^2)\) random variables.
#Errors vs Fitted Values
ggplot() + aes(x = modelMM$fitted.values, y = modelMM$residuals) + geom_point() + labs(x = "Fitted Values", y = "Residuals") + geom_abline(intercept = 0, slope = 0)
# ggplot(modelMM, aes(qqnorm(.stdresid)[[1]], .stdresid)) + geom_point() + geom_abline() + labs(x = "Theoretical Quantiles", y = "Standardized Residuals", title = "Normal Q-Q")
## Normal Q-Q plot
ggplot(muscle, aes(sample = mass)) + stat_qq() + stat_qq_line() + labs(x = "Theoretical Quantiles", y = "Sample Quantiles", title = "Normal Q-Q")
# {qqnorm(muscle$mass)
# qqline(muscle$mass)}
Both of our residual plots suggest that our normality assumption is met. The Error v.s. Fitted Values plot shows no patterns, with random scatter through out the plot. The Normal Q-Q plot follows fairly close to the standard line, with no strong deviation which would indicate skewing or heavy tails.
g.) We test: \[ H_0: \beta_0 = 0 \\ H_a: \beta_0 \neq 0 \]
xbar <- mean(muscle$age)
as.numeric(
c(
modelMM$coefficients[1] - qt(1 - 0.01 / 2, nrow(muscle) - 2) * sumModelMM$coefficients[1, 2],
modelMM$coefficients[1] + qt(1 - 0.01 / 2, nrow(muscle) - 2) * sumModelMM$coefficients[1, 2]
)
)
## [1] 141.6658 171.0273
We are 99% confident that the true value for \(\beta_0\) is contained within the interval (141.6658, 171.0273).
h.) We test the hypothesis: \[ H_0: \beta_1 \geq 0 \\ H_a: \beta_1 < 0 \] With test statistic: \[ T^* = \frac{\hat{\beta_1} - \beta_1^{(0)}}{s(\beta_1)} \\ T^* \sim t_{(n-2)} \] We reject \(H_0\) if and only if \(T^* < t(\alpha, n-2)\).
Tstar <- sumModelMM$coefficients[2,1] / sumModelMM$coefficients[2,2]
Tstar
## [1] -13.19326
Tstar < qt(0.01, nrow(muscle) - 2)
## [1] TRUE
Since our T statistic is less than our critical value, we reject our null hypothesis and conclude that there is a negative linear association between the amount of muscle mass and age for a woman.
i.)
predict(
modelMM,
newdata = data.frame(
age = 60
),
interval = "prediction",
level = 0.95
)
## fit lwr upr
## 1 84.94683 68.45067 101.443
We predict with 95% confidence that the muscle mass for a woman aged 60 will be contained within the interval (68.451, 101.443).
j.) We test the hypothesis: \[ H_0: \beta_1 = 0 \\ H_a: \beta_1 \neq 0 \] With test statistic: \[ F^* = \frac{MSR}{MSE} \\ F^* \sim F_{(1, n-2)} \] We reject \(H_0\) if \(F^* > F(1 - \alpha, 1, n-2)\).
anova(modelMM)
## Analysis of Variance Table
##
## Response: mass
## Df Sum Sq Mean Sq F value Pr(>F)
## age 1 11627.5 11627.5 174.06 < 2.2e-16 ***
## Residuals 58 3874.4 66.8
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
At \(\alpha = 0.01\) and a p-value of \(2.2e-16\), we reject \(H_0\), concluding that there is a linear association between the amount of muscle mass and age.
k.)
#Proportion of total variation is measured by R^2
sumModelMM$r.squared
## [1] 0.7500668
cor(muscle$age, muscle$mass)
## [1] -0.866064
75.01% of the variation in muscle mass is explained by the variation in age. We can also see that muscle mass and age have a correlation value of -0.8661, indicting a strong, negative relationship.
For the plot in the top left, we can see that the distribution is right skewed, given that the points are condensing in lower values of the sample quantiles. For the top right plot, we can see this data has lighter tails and more condensed data in the center, making a more narrow distribution. For the bottom left, this seems to be a bimodal plot, given the two clusters at the bottom left and top right of the qqplot. For the final plot, the bottom right, this is data that looks to be approximately normal, with no obvious signs of skewness.